Goto

Collaborating Authors

 validation sample





Personalizing black-box models for nonparametric regression with minimax optimality

arXiv.org Machine Learning

Recent advances in large-scale models, including deep neural networks and large language models, have substantially improved performance across a wide range of learning tasks. The widespread availability of such pre-trained models creates new opportunities for data-efficient statistical learning, provided they can be effectively integrated into downstream tasks. Motivated by this setting, we study few-shot personalization, where a pre-trained black-box model is adapted to a target domain using a limited number of samples. We develop a theoretical framework for few-shot personalization in nonparametric regression and propose algorithms that can incorporate a black-box pre-trained model into the regression procedure. We establish the minimax optimal rate for the personalization problem and show that the proposed method attains this rate. Our results clarify the statistical benefits of leveraging pre-trained models under sample scarcity and provide robustness guarantees when the pre-trained model is not informative. We illustrate the finite-sample performance of the methods through simulations and an application to the California housing dataset with several pre-trained models.


TED++: Submanifold-Aware Backdoor Detection via Layerwise Tubular-Neighbourhood Screening

arXiv.org Artificial Intelligence

As deep neural networks power increasingly critical applications, stealthy backdoor attacks, where poisoned training inputs trigger malicious model behaviour while appearing benign, pose a severe security risk. Many existing defences are vulnerable when attackers exploit subtle distance-based anomalies or when clean examples are scarce. To meet this challenge, we introduce TED++, a submanifold-aware framework that effectively detects subtle backdoors that evade existing defences. TED++ begins by constructing a tubular neighbourhood around each class's hidden-feature manifold, estimating its local ``thickness'' from a handful of clean activations. It then applies Locally Adaptive Ranking (LAR) to detect any activation that drifts outside the admissible tube. By aggregating these LAR-adjusted ranks across all layers, TED++ captures how faithfully an input remains on the evolving class submanifolds. Based on such characteristic ``tube-constrained'' behaviour, TED++ flags inputs whose LAR-based ranking sequences deviate significantly. Extensive experiments are conducted on benchmark datasets and tasks, demonstrating that TED++ achieves state-of-the-art detection performance under both adaptive-attack and limited-data scenarios. Remarkably, even with only five held-out examples per class, TED++ still delivers near-perfect detection, achieving gains of up to 14\% in AUROC over the next-best method. The code is publicly available at https://github.com/namle-w/TEDpp.



Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

arXiv.org Artificial Intelligence

Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these "judge" models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt en-sembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. T o address these limitations, we propose a new multimodal-aware method called Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. W e show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge's true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.


Bridging Econometrics and AI: VaR Estimation via Reinforcement Learning and GARCH Models

arXiv.org Artificial Intelligence

Context: Forecasting stock returns is a long-standing challenge in financial economics, with significant implications for both risk management and regulatory compliance. Traditional econometric models such as GARCH (Bollerslev, 1986) capture volatility persistence but fail to fully account for key stylized facts of financial time series: fat tails, volatility clustering, and leverage effects (Glosten et al., 1993). Similarly, modern machine learning and deep learning methods, although capable of modeling nonlinear dynamics (Goodfellow et al., 2016; Tealab, 2018), tend to underperform during rare but impactful market shocks (Fawcett and Provost, 1997; Pokou, 2022). As illustrated in Figure 1, these limitations often result in systematic mispredictions of excess returns, especially in turbulent markets. These forecasting inaccuracies are critical because they directly translate into unreliable estimates of Value-at-Risk (VaR), the benchmark risk measure under Basel regulatory frameworks (on Banking Supervision, 2017). Overestimation inflates capital requirements, whereas underestimation exposes institutions to excessive losses. To mitigate these shortcomings, the recent literature has shifted from precise return forecasting to directional return prediction, reframe the task as a classification problem, determining whether returns will be positive or negative (Kanas, 2001; Nyberg, 2011; Alostad and Davulcu, 2017). Beyond the standard zero threshold, quantile and volatility-based criteria have been introduced to better isolate significant market movements (Chung and Hong, 2007; Linton and Whang, 2007).


Acoustic evaluation of a neural network dedicated to the detection of animal vocalisations

arXiv.org Artificial Intelligence

The accessibility of long-duration recorders, adapted to sometimes demanding field conditions, has enabled the deployment of extensive animal population monitoring campaigns through ecoacoustics. The effectiveness of automatic signal detection methods, increasingly based on neural approaches, is frequently evaluated solely through machine learning metrics, while acoustic analysis of performance remains rare. As part of the acoustic monitoring of Rock Ptarmigan populations, we propose here a simple method for acoustic analysis of the detection system's performance. The proposed measure is based on relating the signal-to-noise ratio of synthetic signals to their probability of detection. We show how this measure provides information about the system and allows optimisation of its training. We also show how it enables modelling of the detection distance, thus offering the possibility of evaluating its dynamics according to the sound environment and accessing an estimation of the spatial density of calls.


Large Language Models: An Applied Econometric Framework

arXiv.org Artificial Intelligence

How can we use the novel capacities of large language models (LLMs) in empirical research? And how can we do so while accounting for their limitations, which are themselves only poorly understood? We develop an econometric framework to answer this question that distinguishes between two types of empirical tasks. Using LLMs for prediction problems (including hypothesis generation) is valid under one condition: no ``leakage'' between the LLM's training dataset and the researcher's sample. No leakage can be ensured by using open-source LLMs with documented training data and published weights. Using LLM outputs for estimation problems to automate the measurement of some economic concept (expressed either by some text or from human subjects) requires the researcher to collect at least some validation data: without such data, the errors of the LLM's automation cannot be assessed and accounted for. As long as these steps are taken, LLM outputs can be used in empirical research with the familiar econometric guarantees we desire. Using two illustrative applications to finance and political economy, we find that these requirements are stringent; when they are violated, the limitations of LLMs now result in unreliable empirical estimates. Our results suggest the excitement around the empirical uses of LLMs is warranted -- they allow researchers to effectively use even small amounts of language data for both prediction and estimation -- but only with these safeguards in place.